43 research outputs found

    Logistic Ensemble Models

    Get PDF
    Predictive models that are developed in a regulated industry or a regulated application, like determination of credit worthiness must be interpretable and “rational” (e.g., improvements in basic credit behavior must result in improved credit worthiness scores). Machine Learning technologies provide very good performance with minimal analyst intervention, so they are well suited to a high volume analytic environment but the majority are “black box” tools that provide very limited insight or interpretability into key drivers of model performance or predicted model output values. This paper presents a methodology that blends one of the most popular predictive statistical modeling methods with a core model enhancement strategy, found in machine learning. The resulting prediction methodology provides solid performance, from minimal analyst effort, while providing the interpretability and rationality, required in regulated industries

    Binary Classification on Past Due of Service Accounts using Logistic Regression and Decision Tree

    Get PDF
    This paper aims at predicting businesses’ past due in service accounts as well as determining the variables that impact the likelihood of repayment. Two binary classification approaches, logistic regression and the decision tree, were conducted and compared. Both approaches have very good performances with respect to the accuracy. However, the decision tree only uses 10 predictors and reaches an accuracy of 96.69% on the validation set while logistic regression includes 14 predictors and reaches an accuracy of 94.58%. Due to the large concern of false negatives in financial industry, the decision tree technique is a better option than logistic regression on the given dataset in terms of its relative lower false negative. Accuracy, false positive and false negative are all very important criteria in model selection and evaluation. Decision making should rely more on the research purpose, rather than on the exact values of these criteria

    An Analysis of Accuracy using Logistic Regression and Time Series

    Get PDF
    This paper analyzes the accuracy rates for logistic regression and time series models. It also examines a relatively new performance index that takes into consideration the business assumptions of credit markets. Although prior research has focused on evaluation metrics, such as AUC and Gini index, this new measure has a more intuitive interpretation for various managers and decision makers and can be applied to both Logistic and Time Series models

    A Comparison of Machine Learning Techniques and Logistic Regression Method for the Prediction of Past-Due Amount

    Get PDF
    The aim of this paper to predict a past-due amount using traditional and machine learning techniques: Logistic Analysis, k-Nearest Neighbor and Random Forest. The dataset to be analyzed is provided by Equifax, which contains 305 categories of financial information from more than 11,787,287 unique businesses from 2006 to 2014. The big challenge is how to handle with the big and noisy real world datasets. Among the three techniques, the results show that Logistic Regression Method is the best in terms of predictive accuracy and type I errors

    Counting the Impossible: Sampling and Modeling to Achieve a Large State Homeless Count

    Get PDF
    Objective: Using inferential statistics, we develop estimates of the homeless population of a geographically large and economically diverse state -- Georgia. Methods: Multiple independent data sources (2000 U.S. Census, the 2006 Georgia County Guide, Georgia Chamber of Commerce) were used to develop Clusters of the 150 Georgia Counties. These clusters were used as strata to then execute traified sampling. Homeless counts were conducted within the sample counties, allowing for multiple regression models to be developed to generate predictions of homeless persons by county. Results: In response to a mandate from the US Department of Housing and Urban Development, the State of Georgia provided an estimate of its unsheltered homeless population of 12,058 utilizing mathematically validated estimation techniques. Conclusions: Utilization of statistical estimation techniques allowed the State of Georgia to meet the mandate of HUD, while saving the taxpayers of Georgia millions of dollars over a complete state homeless census

    Application of Isotonic Regression in Predicting Business Risk Scores

    Get PDF
    An isotonic regression model fits an isotonic function of the explanatory variables to estimate the expectation of the response variable. In other words, as the function increases, the estimated expectation of the response must be non-decreasing. With this characteristic, isotonic regression could be a suitable option to analyze and predict business risk scores. A current challenge of isotonic regression is the decrease of performance when the model is fitted in a large data set e.g. more than four or five dimensions. This paper attempts to apply isotonic regression models into prediction of business risk scores using a large data set – approximately 50 numeric variables and 24 million observations. Evaluations are based on comparing the new models with a traditional logistic regression model built for the same data set. The primary finding is that isotonic regression using distance aggregate functions does not outperform logistic regression. The performance gap is narrow however, suggesting that isotonic regression may still be used if necessary since isotonic regression may achieve better convergence speed in massive data sets

    The Validity of Online Patient Ratings of Physicians

    Get PDF
    Background: Information from ratings sites are increasingly informing patient decisions related to health care and the selection of physicians. Objective: The current study sought to determine the validity of online patient ratings of physicians through comparison with physician peer review. Methods: We extracted 223,715 reviews of 41,104 physicians from 10 of the largest cities in the United States, including 1142 physicians listed as “America’s Top Doctors” through physician peer review. Differences in mean online patient ratings were tested for physicians who were listed and those who were not. Results: Overall, no differences were found between the online patient ratings based upon physician peer review status. However, statistical differences were found for four specialties (family medicine, allergists, internal medicine, and pediatrics), with online patient ratings significantly higher for those physicians listed as a peer-reviewed “Top Doctor” versus those who were not. Conclusions: The results of this large-scale study indicate that while online patient ratings are consistent with physician peer review for four nonsurgical, primarily in-office specializations, patient ratings were not consistent with physician peer review for specializations like anesthesiology. This result indicates that the validity of patient ratings varies by medical specialization

    The expanded view of individualism and collectivism: One, two, or four dimensions?

    Get PDF
    Recent research to analyze and discuss cultural differences has employed a combination of five major dimensions of individualism–collectivism, power distance, uncertainty avoidance, femininity– masculinity (gender role differentiation), and long-term orientation. Among these dimensions, individualism–collectivism has received the most attention. Chronologically, this cultural attribute has been regarded as one, then two, and more recently, four dimensions of horizontal and vertical individualism and collectivism. However, research on this issue has not been conclusive and some have argued against this expansion. The current study attempts to explain and clarify this discussion by using a shortened version of the scale developed by Singelis et al. ((1995) Horizontal and vertical dimensions of individualism and collectivism: a theoretical and measurement refinement. CrossCultural Research 29(3): 240–275). Our analysis of aggregate data from 802 respondents from nine countries supports the expanded view. Data aggregation was based on the Mindscape Theory that proposes inter- and intracultural heterogeneity. This finding is reassuring to scholars who have been using the shortened version of the instrument because confirmatory factor analysis indicated its validity. The findings of the present study provides clarification of some apparent ambiguity in recent research in specifying some cultures such as India, Israel, and Spain as individualists or collectivists. By separating the four constructs, more nuanced classification is possible. Also, such a distinction enables us to entertain such concepts as the Mindscape Theory that proposes a unique intracultural and transcultural heterogeneity that do not stereotype the whole culture as either individualist or collectivis

    A Comparison of Decision Tree with Logistic Regression Model for Prediction of Worst Non-Financial Payment Status in Commercial Credit

    Get PDF
    Credit risk prediction is an important problem in the financial services domain. While machine learning techniques such as Support Vector Machines and Neural Networks have been used for improved predictive modeling, the outcomes of such models are not readily explainable and, therefore, difficult to apply within financial regulations. In contrast, Decision Trees are easy to explain, and provide an easy to interpret visualization of model decisions. The aim of this paper is to predict worst non-financial payment status among businesses, and evaluate decision tree model performance against traditional Logistic Regression model for this task. The dataset for analysis is provided by Equifax and includes over 300 potential predictors from more than 11 million unique businesses. After a data discovery phase, including imputation, cleaning, and transforming potential predictors, Decision Tree and Logistic Regression models were built on the same finalized analysis dataset. Evaluating the models based on ROC index, and Kolmogorov-Smirnov statistic, Decision Tree performed as well as the Logistic Regression model

    An overview of the extratropical storm tracks in CMIP6 historical simulations

    Get PDF
    The representation of the winter and summer extratropical storm tracks in both hemispheres is evaluated in detail for the available models in the 6th phase of the Coupled Model Intercomparison Project (CMIP6). The state of the storm tracks from 1979-2014 is compared to that in ERA5 using a Lagrangian objective cyclone tracking algorithm. It is found that the main biases present in the previous generation of models (CMIP5) still persist, albeit to a lesser extent. The equatorward bias around the SH is much reduced and there appears to be some improvement in mean biases with the higher resolution models, such as the zonal tilt of the North Atlantic storm track. Low resolution models have a tendency to under-estimate the frequency of high intensity cyclones with all models simulating a peak intensity that is too low for cyclones in the SH. Explosively developing cyclones are under-estimated across all ocean basins and in both hemispheres. In particular the models struggle to capture the rapid deepening required for these cyclones. For all measures, the CMIP6 models exhibit an overall improvement compared to the previous generation of CMIP5 models. In the NH most improvements can be attributed to increased horizontal resolution, whereas in the SH the impact of resolution is less apparent and any improvements are likely a result of improved model physics
    corecore